# CUDA MODE: Lecture 1

Mark Saroufim

## Logistics

Hosts: Andreas Köpf, Thomas Viehmann, Mark Saroufim

1 per 2 week on a CUDA topic: textbook chapter, pair programming session or project

Target audience is torch programmers tired of CUDA tutorial hell

Textbook: Programming Massively Parallel Processors

Additional resources here in <u>resource-stream</u>

All communication will happen on our **Discord** 

Sessions will be recorded on <a href="https://www.youtube.com/@CUDAMODE">https://www.youtube.com/@CUDAMODE</a>

### Goal of Lecture 1

- Integrate a CUDA kernel inside a pytorch program
- 2. Learn how to profile it

Most of the code is here https://github.com/msaroufim/cudamc ture1

I believe thing I see



# Start with something simple



```
def time pytorch function(func, input):
           # CUDA IS ASYNC so can't use python time module
          start = torch.cuda.Event(enable_timing=True)
           end = torch.cuda.Event(enable timing=True)
          # Warmup
          for in range(5):
               func(input)
          start.record()
          func(input)
          end.record()
          torch.cuda.synchronize()
          return start.elapsed time(end)
       b = torch.randn(10000, 10000).cuda()
       def square 2(a):
          return a * a
       def square 3(a):
          return a ** 2
       time pytorch function(torch.square, b)
       time pytorch function(square 2, b)
       time_pytorch_function(square_3, b)
36
       print("======")
       print("Profiling torch.square")
       print("======")
```

# PyTorch profiler

Memcpy HtoD (Pageable -> Device)

- Host to device copy
- Pageable memory is on host but can be copied freely in out of RAM



https://github.com/msaroufim/cudamodelecture1/blob/main/pt\_profiler.py

#### What can we learn?

Aten::square is a call to aten:pow

A cuda kernel gets launched called native\_vectorized\_elementwise\_kernel<4, ..>

4 is the number of blocks

https://github.com/pytorch/pytorch/blob/main/caffe2/utils/math/elementwise.cu



## Custom cpp extensions

```
hello_load_inline.py > ...
 1 ∨ import torch
     from torch.utils.cpp_extension import load_inline
   v cpp_source = """
     std::string hello_world() {
        return "Hello World!";
     1111111
   wmy_module = load_inline(
11
          name='my_module',
12
          cpp_sources=[cpp_source],
          functions=['hello_world'],
13
          verbose=True
14
15
17
     print(my_module.hello_world())
```

# Codegen

```
#include <torch/extension.h>
std::string hello_world() {
 return "Hello World!";
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("hello_world", torch::wrap_pybind_function(hello_world), "hello_world");
```

# How to run a CUDA kernel from pytorch

https://github.com/msaroufim/cudamodelecture1/blob/main/load\_inline.py

### numba

https://gist.github.com/msaroufim/6673c9e5c0c3d58740472601eac6d4df

## Integrate a triton kernel

```
def square(x):
   n rows, n cols = x.shape
   # The block size is the smallest power of two greater than the number of columns in `x`
   BLOCK SIZE = triton.next power of 2(n cols)
   # Another trick we can use is to ask the compiler to use more threads per row by
   # increasing the number of warps (`num_warps`) over which each row is distributed.
   # You will see in the next tutorial how to auto-tune this value in a more natural
   # way so you don't have to come up with manual heuristics yourself.
   num warps = 4
   if BLOCK SIZE >= 2048:
       num_warps = 8
   if BLOCK SIZE >= 4096:
       num warps = 16
   # Allocate output
   y = torch.empty like(x)
   # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
   # f the input matrix
   square_kernel[(n_rows, )](
       у,
       х,
       x.stride(0),
       y.stride(0),
       n_cols,
       num_warps=num_warps,
       BLOCK SIZE=BLOCK SIZE,
   return y
```

#### Triton kernel on A10G



https://gist.github.com/msaroufim/8649307ecdbb9309ced2d5106073bc0c

# torch.compile



## Results on 4090



# After fixing the block size to 1024



## Triton has a debugger now

triton.jit(interpret=True)

Almost everything is a WrappedTensor so inspect variables with var\_name.tensor

https://gist.github.com/msaroufim/f849df30687708782e0269c4b42264b1

#### Look at PTX

https://github.com/msaroufim/cud amodelecture1/blob/main/square kernel.ptx

8 registers with a self multiplication for input

8 registers for output

This means Triton is using 8 registers for storing inputs and another 8 registers for storing outputs.

.loc 1 19 26

| mul.f32 | %f9, %f1, %f1;  |
|---------|-----------------|
| mul.f32 | %f10, %f2, %f2; |
| mul.f32 | %f11, %f3, %f3; |
| mul.f32 | %f12, %f4, %f4; |
| mul.f32 | %f13, %f5, %f5; |
| mul.f32 | %f14, %f6, %f6; |
| mul.f32 | %f15, %f7, %f7; |
| mul.f32 | %f16, %f8, %f8; |

#### Cheat: Generate a triton kernel

TORCH\_LOGS="output\_code" python compile\_square.py

torch.compile(torch.square))

```
import triton
import triton.language as tl
from torch. inductor.ir import ReductionHint
from torch. inductor.ir import TileHint
from torch. inductor.triton heuristics import AutotuneHint, pointwise
from torch. inductor.utils import instance descriptor
from torch. inductor import triton helpers
@pointwise(size hints=[2097152], filename= file , meta={'signature': {0: '*fp32', 1: '*fp32', 2:
@triton.jit
def triton (in ptr0, out ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 1423763
    xoffset = tl.program id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel</pre>
    x0 = xindex
    tmp0 = tl.load(in ptr0 + (x0), xmask)
    tmp1 = tmp0 * tmp0
    tl.store(out_ptr0 + (x0), tmp1, xmask)
```

# ncu profiler

ncu python train.py

ncu --set full -o output \$(which python) train.py

https://github.com/msaroufim/cudamodelecture1/blob/main/ncu\_logs

Contains actionable hints like

OPT This kernel grid is too small to fill the available resources on this device, resulting in only 0.4 full waves across all SMs. Look at Launch Statistics for more details.



#### Zoom in

Tail effect + Achieved occupancy 70%: Try padding (We can control)

Long scoreboard stalls: coalesce, use shared memory (Controlled by Triton:()

|                          | CUDA   | TRITON    |
|--------------------------|--------|-----------|
| Memory Coalescing        | Manual | Automatic |
| Shared Memory Management | Manual | Automatic |
| Scheduling (Within SMs)  | Manual | Automatic |
| Scheduling (Across SMs)  | Manual | Manual    |
|                          |        |           |

Compiler optimizations in CUDA vs Triton.



